Monagas State
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation
Bergman, Shai, Ji, Zhang, Kermarrec, Anne-Marie, Petrescu, Diana, Pires, Rafael, Randl, Mathis, de Vos, Martijn
Retrieval-augmented generation (RAG) enhances the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive. To address this, we introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries. Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear, reducing reliance on expensive vector database lookups. We evaluate Proximity on the MMLU and MedRAG benchmarks, demonstrating that it significantly improves retrieval efficiency while maintaining response accuracy. Proximity reduces retrieval latency by up to 59% while maintaining accuracy and lowers the computational burden on the vector database. We also experiment with different similarity thresholds and quantify the trade-off between speed and recall. Our work shows that approximate caching is a viable and effective strategy for optimizing RAG-based systems.
Adapting Large Language Models via Reading Comprehension
Cheng, Daixuan, Huang, Shaohan, Wei, Furu
We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
Parida, Shantipriya, Abdulmumin, Idris, Muhammad, Shamsuddeen Hassan, Bose, Aneesh, Kohli, Guneet Singh, Ahmad, Ibrahim Said, Kotwal, Ketan, Sarkar, Sayan Deb, Bojar, Ondลej, Kakudi, Habeebah Adamu
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.
The Venezuelans Trying to Escape Their Country Through Video Game Grunt Work
On a recent afternoon in Maracaibo, Venezuela, Alexander Marinez, who has short-cropped black hair and three-to-four-day stubble, sat in front of his computer tracking herbiboars in the mushroom forests on Fossil Island. He pressed down on his glowing mouse, the newest addition to his otherwise timeworn gaming setup. The pixelated character on his computer screen followed the tracks of a hedgehoglike creature with triangular tusks and herbs growing out of its back. Outside Marinez's one-story house, the sun bore down on the dirt road. His home lies about six miles away from the strait that connects the Caribbean Sea with Lake Maracaibo, one of the world's richest sources of oil. The character inspected a tunnel. Suddenly, the herbiboar appeared, and the character attacked, stunning it.
Pull out all the stops: Textual analysis via punctuation sequences
Darmon, Alexandra N. M., Bazzi, Marya, Howison, Sam D., Porter, Mason A.
Whether enjoying the lucid prose of a favorite author or slogging through some other writer's cumbersome, heavy-set prattle (full of parentheses, em dashes, compound adjectives, and Oxford commas), readers will notice stylistic signatures not only in word choice and grammar, but also in punctuation itself. Indeed, visual sequences of punctuation from different authors produce marvelously different (and visually striking) sequences. Punctuation is a largely overlooked stylistic feature in "stylometry", the quantitative analysis of written text. In this paper, we examine punctuation sequences in a corpus of literary documents and ask the following questions: Are the properties of such sequences a distinctive feature of different authors? Is it possible to distinguish literary genres based on their punctuation sequences? Do the punctuation styles of authors evolve over time? Are we on to something interesting in trying to do stylometry without words, or are we full of sound and fury (signifying nothing)?
1000 novels everyone must read: Science Fiction & Fantasy (part two)
When Haldeman returned from Vietnam, with a Purple Heart for the wounds he had suffered, he wrote a story about a pointless conflict that seems as if it will never end. It was set in space, and the enemies were aliens, but 18 publishers decided it was too close to home before St Martin's Press took a gamble. The book that "nobody wants to read" went on to win many prizes. It's not perfect - it's hard to take seriously a future in which hetereosexuality is a perversion - but the anti-war message is as powerful as ever. Known for his intricate short stories and critically acclaimed mountaineering novel Climbers, Harrison cut his teeth on SF. In typical fashion, he writes space opera better than many who write only in the genre. For all its star travel and alien artefacts, scuzzy 25th-century spaceports and drop-out space pilots, Light is actually about twisting three plotlines as near as possible to snapping point. This is as close as SF gets to literary fiction, and literary fiction gets to SF. Jon Courtenay Grimwood Buy this book at the Guardian bookshop Amateur stonemason, waterbed designer, reformed socialist, nudist, militarist and McCarthyite, Heinlein is one of the most interesting and irritating figures in American science fiction.